Using Decision Trees and Support Vector Machines to Classify Genes by Names

نویسندگان

  • Simon Lin
  • Sandip Patel
  • Andrew Duncan
  • Linda Goodwin
چکیده

In this paper we report an application of machine learning methods to classify gene names into two categories: known and unknown ones. We acquired a data set of 1,624 genes by letting a human expert classify them manually. To capture the knowledge of classification, we also asked the expert to derive a set of rules. In parallel, we trained two machine learners to capture the same knowledge. Both decision trees (CART) and Support Vector Machines (SVMs) outperform the expert rules; the cross-validated error rates are below 1%, and the area under the curve of Receiver Operating Characteristic (ROC) curves reach higher than 0.99. In summary, CART and SVMs reduced the overall error rate of prediction by 40% and 88%, respectively, compared with classification using expert rule sets. In addition, the machine classifiers are able to find some errors made by the human expert himself. Finally, we used the expert system to classify 7,447 genes on the Affymetrix U74A microarray chip. Results show 70% of the genes on this chip are known ones. In conclusion, we successfully demonstrate that the machine-derived classifiers are more capable of handling the job efficiently than the expertderived classifier. It further supports the idea that in many application domains, experts can perform the task, but cannot tell how; whereas expert systems are able to capture the knowledge from the experts.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

کاربرد الگوریتم‌های داده‌کاوی در تفکیک منابع رسوبی حوزۀ آبخیز نوده گناباد

Introduction: Reduction of sediment supply requires the implementation of soil conservation and sediment control programs in the form of watershed management plans. Sediment control programs require identifying the relative importance of sediment sources, their quantitative ascription and identification of critical areas within the watersheds. The sediment source ascription is involves two...

متن کامل

Feature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine

We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...

متن کامل

Machine Learning Using for Classification of Heart Failure

--------------------------------------------------------***--------------------------------------------------------Abstract Physicians classify patients into those with or without a specific disease. Classification trees are frequently used to classify patients according to the presence or absence of a disease. In the data-mining and machine learning, alternate classification schemes have been ...

متن کامل

Mining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM

Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...

متن کامل

Anomaly Detection Using SVM as Classifier and Decision Tree for Optimizing Feature Vectors

Abstract- With the advancement and development of computer network technologies, the way for intruders has become smoother; therefore, to detect threats and attacks, the importance of intrusion detection systems (IDS) as one of the key elements of security is increasing. One of the challenges of intrusion detection systems is managing of the large amount of network traffic features. Removing un...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003